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Abstract 

This article deals with the Grassmann manifold as a submanifold of the matrix Eu- 
clidean space, that is, as the set of all orthogonal projection matrices of constant rank, 
and sets up several optimization algorithms in terms of such matrices. 

Interest will center on the steepest descent method and Newton's method on the Grass- 
mann manifold together with applications to matrix eigenvalue problems. It is shown that 
Newton's equation in the proposed Newton's method applied to the Rayleigh quotient min- 
imization problem takes the form of a Lyapunov equation, for which an existing efficient 
algorithm can be applied, and thereby the present Newton's method works efficiently. It 
is furthermore shown that in case of degenerate eigenvalues the optimal solutions form 
a submanifold diffcomorphic to a Grassmann manifold of lower dimension. Incidentally, 
as is well known, Newton's method does not necessarily generate sequences converging 
to a global optimal solution. To generate globally converging sequences, this article pro- 
vides a hybrid method composed of the steepest descent and Newton's methods on the 
Grassmann manifold together with convergence analysis. If an approximate solution to a 
matrix eigenvalue problem is known by using an existing algorithm, it is already sitting 
in a region of convergence to a global optimal solution, and then the present Newton's 
method can serve to improve the solution. 

Keywords: conjugate gradient method; Riemannian optimization; global convergence; "scaled" 
vector transport; Wolfe conditions 

1 Introduction 



The conjugate gradient method was first developed by Hestenes and Stiefel as a tool for solving 
the linear equation Ax = b, where A is an n x n positive definite matrix [5j- The strategy of 
the linear conjugate gradient method is to minimize the quadratic function x T Ax/2 — b T x of x 
in the successive search directions which are generated in such a manner that those directions 
are mutually conjugate with respect to A and eventually span the whole M n . As this method 
is generalized to be applicable to functions which are not restricted to those quadratic in x, 
the conjugate gradient method in its original form is particularly called the linear conjugate 
gradient method. 
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According to a nonlinear conjugate gradient method for minimizing a smooth function / 
which is not necessarily quadratic, the search direction rjk is determined by 



where (3k is a parameter to be defined suitably. Fletcher and Reeves @] proposed to define 
h by /3k ■= ||grad/(xfc)|| 2 /||grad/(x/ i ._ 1 )|| 2 (see [6] for another way to determine (3k)- 

On the other hand, iterative optimization methods on R n have been developed so as to be 
applicable on Riemannian manifolds [U |3] . Those generalized methods are called Riemannian 
optimization methods, which provide procedures for minimizing objective functions defined 
on a Riemannian manifold M. In a Riemannian optimization method, the usual line search 
should be replaced [T], as the concept of line is generalized on a Riemannian manifold. Absil, 
Mahony, and Sepulchre proposed to use a retraction map to perform a search on a curve on 
M in place of the line search. As for the conjugate gradient method, Smith provided in [5] 
a conjugate gradient method on M along with other optimization algorithms on M. The 
difficulty we encounter in generalizing the conjugate gradient method to that on a manifold 
is that Eq. (jl.ip makes no longer sense. This is because gradf(xk) and rjk-i belong to 
tangent spaces at different points on M in general, so that they cannot be added. Smith 
proposed to use the parallel translation along the geodesic at each iteration in order to make 
possible the addition of two tangent vectors and thereby to extend the iteration procedure 
(jl.ip . However, the parallel translation does not meet a computational purpose on M. A 
way to perform the conjugate gradient method on M in computationally efficient manner is 
to use a vector transport pQ. The global convergence in the conjugate gradient method with 
a vector transport on M has been recently discussed by Ring and Wirth [7]. They proved the 
global convergence under the condition that the vector transport in use does not increase the 
norm of the search direction vector. However, if this assumption is not verified, the conjugate 
gradient method with a general vector transport may fail to generate a globally converging 
series. 

In this paper, the notion of a scaled vector transport is introduced in Section [2] after a brief 
review of some useful existing concepts. In Section [3l a brief review is made of the conjugate 
gradient method on a Riemannian manifold M, and then a new algorithm is proposed, in 
which the scaled vector transport is applied on a fitting occasion. How to compute the step 
size is also discussed in this section. In Section UJ the global convergence for the proposed 
algorithm is proved in a manner similar to the usual one performed on M n , where the scaled 
vector transport used on a fitting occasion gets a generated sequence to converge. Section 
provides numerical experiments on a simple problem which the existing algorithm cannot 
solve but the proposed algorithm can do. The numerical experiments show why the present 
algorithm can generate convergent sequences. Section [6] includes concluding remarks. It is 
shown in Appendix[A]that the Lipschitzian condition referred to in Subsection 14. II is satisfied 
for a practical Riemannian optimization problem. 

2 Setting up for Riemannian optimization 

2.1 Retraction 

An unconstrained optimization problem on a Riemannian manifold M is described as follows: 



rjk = ~ grad f(x k ) + fik!)k-\ 
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Problem 2.1. 



minimize f(x), 
subject to x G M. 



(2.1) 
(2.2) 



If M is the Euclidean space M n , the line search is performed with the updating formula 



where x k ,x k+ i G K n are a current point and an unknown next point, respectively, and where 
rjk G K n and > are a search direction at Xk and a step size, respectively. However, the 
line search (|2.3p does not make sense on a general manifold Af . In order to generalize the line 
search (|2.3[> on M n to that on M, the search direction r\ k should be taken as a tangent vector 
in T Xk M, and the addition in Eq. f|2.3j) should be replaced by another operation. A natural 
alternative to the line search is a search along the geodesic emanating from x k in the direction 
of n k , but the geodesic will cause computational difficulty A computationally efficient way is 
to use the following retraction map introduced in [1]. 

Definition 2.1. Let M and TM be a manifold and the tangent bundle of M , respectively. 
Let R : TM — >■ M be a smooth map and R x the restriction of R to T X M . The R is called a 
retraction on M, if it has the following properties: 

1. R X (0 X ) = x, where X denotes the zero element of T X M . 

2. With the canonical identification Tq x T x M ~ T X M , R x satisfies 



where DR X (0 X ) denotes the derivative of R x at X , and id^M the identity map on T X M. 

As is easily seen, the exponential map on M is a typical example of a retraction. If we 
can find a computationally preferable retraction, we can perform an optimization procedure 
as follows: 

Algorithm 2.1 The general framework of optimization methods for Problem 12.11 on a Rie- 
mannian manifold M 

1: Choose an initial point xq G M. 

2: for fe = 0,1,2, ... do 

3: Compute the search direction rjk G T Xk M and the step size otk > 0. 

4: Compute the next iterate by x^+i '■= Rx k {pt-k r ]k) ■> where R is a retraction on M. 

5: end for 



The choice of a search direction and a step size characterizes the individual optimization 
method. We proceed to the vector transport in search for computationally efficient conjugate 
gradient methods. 



x k+ i =x k + a k n k 



(2.3) 



BR x {0 x ) = id TxM 



(2.4) 
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2.2 Vector transport and scaled vector transport 

In a (nonlinear) conjugate gradient method on the Euclidean space R n , the search directions 
rjk are chosen to be 

rjk = -grad/(x fc ) + PkVk-i, k > 0, (2.5) 

where /3q = 0, and where (3k with k > 1 are determined in several possible manners. For 
example, are determined by 

oFR = grad/(x fc ) T grad/(x fc ) 

Pfc grad/(s fc _i) r grad/(x fc _i)' 1 ' > 

or 

^pr = grad/(x fc ) T (grad/(x fc ) - grad /(x^)) 
fc grad/(x fc _i) T grad/(x fc _i) 

where FR and PR are abbreviations of Fletcher-Reeves and Polak-Ribiere, respectively [6]. 

However, grad/(xfc) S T Xk M and ?7fc_x G T Xk _ 1 M belong to different tangent spaces, so 
that — grad f (xk) + PkVk-i in Eq. ()2.5p does not make sense, if M. n is replaced by a Riemannian 
manifold M. The quantity grad/(x&) — grad/(xfc_i) in Eq. (|2.7p makes no sense on M 
either. In order to make the vector addition in Eqs. (|2.5p and (|2.7p significant on M, Smith 
proposed to use the parallel translation of tangent vectors along a geodesic [9]. However, no 
computationally efficient formula is known for the parallel translation along a geodesic even 
for the Stiefel manifold. Absil et al. [1] proposed the notion of a vector transport as an 
alternative to the parallel translation as follows: 

Definition 2.2. A vector transport T on a manifold M is a smooth map 

TM © TM — » TM : (rj x ^ x ) H- 7^(6) G ™ (2.8) 
satisfying the following properties for all x € M: 

1. There exists a retraction R, called the retraction associated with T , such that 

Rati*), (2.9) 

where ir (%, x (Cx)) denotes the foot of the tangent vector Tr) x (Cx), 

2- r 0x (C x ) = t x for all i x G T X M, 

3- T Vx Hx +b( x ) = aT v Mx) + bT Vx (( x ). 

In what follows, we assume that M is a Riemannian manifold and denote the Riemannian 
metric evaluated at x £ M by (•, ^a,. The vector transport is a generalization of the parallel 
translation and can enhance computational efficiency of algorithms, if defined suitably. 

We here have to note that though the parallel translation is an isometry, a vector transport 
is not required to preserve the norm of vectors in general. In analysing the convergence for 
the conjugate gradient method later, it will be crucial whether the vector transport increases 
the norm of vectors or not. In order to make a given vector transport T isometric, we define 
the scaled vector transport T° : TM © TM — > TM associated with T as 

= HT- }!in r ^)> Vx^x€T x M, (2.10) 

II >vA£,x)\\R x (r ]x ) 
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where 117^, (^r)!!^^) denotes the norm of 7^ x (£ x ) evaluated at R x (rj x ). Then, the scaled 
vector transport 7~"° thus defined no longer satisfies the third condition of Definition 12.21 but 
7~° is isometric in the sense 

ll"C&)lkfe) = HUx, Vx,£,x £ T X M, (2.11) 
where R is the associated retraction. 



3 A conjugate gradient method on a Riemannian manifold 

If a Riemannian manifold M is given a retraction and a vector transport, a Fletcher- Reeves 
type conjugate gradient method on M is described as follows [HE]: 



Algorithm 3.1 A Fletcher- Reeves type conjugate gradient method for Problem 12.11 on a 
Riemannian manifold M 



Choose an initial point xq G M. 
Set 770 = -grad/(x ). 
for fe = 0,1,2, ... do 

Compute the step size a k > and set 



x k+ i = R Xk (a k rj k ) > 



where R is a retraction on M. 
Set 



^ ^ = (grad /(sfc+i), grad /(x fc+1 )) Xfc+1 
(grad grad f(x k )) Xk 
Vk+i = -grad/(x fc+ i) + ^k+iT akVk (rjk) , (3.3) 



where 7" is a (scaled) vector transport on M. 
6: end for 



We here note that in Algorithm 13 .1| T is not necessarily a scaled vector transport. 

Given a (scaled) vector transport along with an associated retraction on M, we need to 
choose a step size in each iteration to perform Algorithm 13.11 

In computing the step size a k in the conjugate gradient method on M n , the strong Wolfe 
conditions are often used [6], which require a k to satisfy 

f(x k + a k r] k ) < f(x k ) + cia k grad f{x k ) T rj k , (3.4) 
|grad/(x fc + a k r] k ) T r] k \ < c 2 |grad f{x k ) T r] k \, (3.5) 

with < c\ < C2 < 1. In particular, ci and C2 are often taken to satisfy < c\ < C2 < 1/2 
in the conjugate gradient method. In order to extend the strong Wolfe conditions on M n to 
those on M, we start by reviewing the strong Wolfe conditions (|3.4p and (|3.5|) . For a current 
point 2;^ and a search direction one performs a line search for the function defined by 

4>(a) = f{x k + arj k ), a > 0. (3.6) 
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Requiring a k to give a sufficient decrease in the value of /, one imposes the condition 

0K) <0(O) + cia fe 0'(O), (3.7) 

which yields (|3.4p . In order to prevent a k from being excessively short, the a k is required to 
satisfy 

\<P'(a k )\<c 2 \<p'(0)\, (3.8) 

which implies (|3.5|) . 

In order to generalize the strong Wolfe conditions to those on M, we define a function 
on M, in an analogous manner to (|3.6|) . to be 

0(a) = f (R Xk (an k )) , a > 0, (3.9) 

where i? is a retraction on M. The conditions (|3.7|) and (|3.8p applied to (|3.9|) give rise to 

f (Rx k (akVk)) < f(x k ) + cia k (gr&df(x k ),r]k)x k , (3.10) 

|(grad/(i? a:fe (a fc r/ fc )) .D-R^ (a fc r/ fc ) [m]) B^ k (a km )\ < c 2 |(grad f(x k ), rj k ) Xh \, (3.11) 

respectively, where < c± < c 2 < 1. 

We here look into the second condition (|3.1ip . If we introduce a vector transport T R as 
the differentiated retraction, which is given by 

7^(6) := Di4fe)[&], G T X M, (3.12) 

then Eq. ()3.1ip can be expressed as 

\(giad f (R Xk (a k r] k )) ,Ta kVk (Vk))B, Xk (a k r, k )\ < c 2 |(grad f(x k ),rj k ) Xk \. (3.13) 

We here note that T R satisfies the conditions in Definition ^. 2\ as is easily verified pQ. Though 
the inequality (|3.13|) is valid for the vector transport T R , computational efficiency may in- 
crease if T R can be replaced by another vector transport 7~ in ()3.13p . With this in mind, we 
consider the following condition in place of (|3,13p . 

\(gr&d f (R Xk (a k r] k )) ,T akVk (rik))R Xk (a k r, k )\ < c 2 |(grad f(x k ),Vk)x h \- (3.14) 

We call the conditions (|3.10p and (|3.14p the T-strong Wolfe conditions. We note that T is not 
restricted to a vector transport but may be taken as a scaled vector transport. If T = T R , 
the 7" R -strong Wolfe conditions coincide with the strong Wolfe conditions. 

Returning to the strong Wolfe conditions (|3.10|) and (|3.1ip . we show the existence of a 
step size satisfying ()3.10p and (|3.1ip according to the same method as that applied to the 
strong Wolfe conditions on M n (see [6]). 

Proposition 3.1. Let M be a Riemannian manifold with a retraction R. If a smooth objec- 
tive function f on M is bounded below on {R Xk (arj k )\a > 0} for x k G M and for a descent 
direction r\ k G T Xk M , and if constants c\ and c 2 satisfy < c\ < c 2 < 1, then there exists a 
step size a k which satisfies the strong Wolfe conditions ()3.10|) and (|3.1ip . 
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Proof. Define as in (|3.9p . Then, on account of the assumption in Prop. 13.11 is bounded 
below on (0, oo), hence there exists a > such that 

cf)(a) = (f)(0) + ci0'(O)d, (3.15) 

where we have used the fact that < c\ < 1. Let a' be the minimum among d's satisfying 
(|3.15p . Then, for < a < a' , one has 

0(a) < 0(0) + ci0'(O)a, (3.16) 

which implies (|3.1U|) . 

According to the mean value theorem for 0, there exists a" E (0, a') such that 

0(c/) - 0(0) = a'4>'(a"). (3.17) 

Since rj k is a descent direction, we have 0'(O) < 0. On account of this, Eqs. (|3.17[) and (|3.15p 
with a = a' are put together to yield 

4>'{a") = ci0'(O) > c 2 0'(O). (3.18) 

Since both 4>'(0) and <p'(a") are negative, it follows that 

|0V)l <c 2 |0'(O)|. (3.19) 

If we take := a" > 0, Eqs. ()3.16|) and ()3.19p imply that afe satisfies ()3.7p and (|3.8p . It then 
turns out that the thus found satisfies the strong Wolfe conditions (|3.1U|) and (|3.1ip . This 
ends the proof. □ 

We note that the strong Wolfe conditions f|3. lOj) and (|3.1ip and the existence of a step size 
which satisfies them are also discussed in [7]. We can modify Prop. I37L1 so that the step size 
ak may satisfy the T°-strong Wolfe conditions, where T° denotes the scaled vector transport 
associated with the vector transport given in (|3.12p . 

Proposition 3.2. Let M and f be the same as in Prop. \3.1[ Let a scaled vector transport 
T° be defined to be 

77(0 = pi DW ' )[£l x e M ' *- ( e T - M - (3 - 20) 

Assume that there exists a positive constant m such that 



\BR x (r,m\ Rx (r,) 



<m, n,££T x M, (3.21) 



for any x £ M . If constants c\ and c 2 satisfy < c\ < cijm < 1, then there exists a step size 
ak which satisfies the 7"° -strong Wolfe conditions. 

Proof. Since < c\ < c^jm < 1, Prop. [37T1 implies that there exists > such that 

f (R Xk (a k n k )) < f(x k ) + cia k {gr&df(x k ),n k } Xk , (3.22) 
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(giad f {R Xk (a k r] k )) ,DR Xk (a k Vk) [Vk]) R Xk (a kVk )\ < — |(grad/(x fe ), f]k)x k \- (3.23) 



Prom (pT23j) together with (pT20j) and (pT2Tj) . we have 
|(grad / (R Xk (a k r] k )) , 7^ fc% (%)>«., 



'(grad/ (R Xk (a k r] k )) ,BR Xk (a k r] k ) [ij k ]) R ■ ( akVk )\ 



\\BR Xk (a k r/ k ) [Vk]\\ Rxk ( akVk ) 
<m ■ — \(gra,df(xk),rjk)x k \ = c 2 \{grad f(x k ),rj k ) Xk \. (3.24) 

This completes the proof. □ 

If we know the value of m mentioned in Prop. [372| the T°-strong Wolfe step size with the 
scaled vector transport T° defined by (|3.20p is computed in the same manner as that for the 
strong Wolfe step size. However, even if there exists m mentioned in Prop. 13. 21 it is often 
difficult to know the specific value of m. Therefore, as for the T°-strong Wolfe conditions, we 
are not allowed to appropriately fix the values of c\ and c 2 satisfying < c\ < C2/m < 1 in 
general. 

Taking all these things into account, we are led to the idea that each step size is always 
computed so as to satisfy the strong Wolfe conditions (|3.10p and ()3.11|) , but the scaled vector 
transport is adopted if it is necessary for the purpose of convergence. This idea is realized in 
the following algorithm. 

Algorithm 3.2 A new conjugate gradient method for Problem l2.1l on a Riemannian manifold 

M 

Choose an initial point xq G M. 
Set 7? = - grad/(x ). 
for k = 0,1,2,... do 

Compute the step size oe k > satisfying the strong Wolfe conditions (|3.1U|) and (|3,11|) . 
Set 

x k+1 = R Xk (otkVk) , (3.25) 

where R is a retraction on M. 
5: if \\Ta kVk (Vk)\\x h+1 < \\Vk\\x k then 

6: Set 7~( fe ) = T R , where T R is the vector transport (|3.12p as the differentiated retraction 
on M. 
else 

q-{k) _ -y-o^ -^gj-g j s ^j^g sca l ec i vector transport ()2.10p associated with ■ 
end if 
Set 



7 
8 
9 

10 



^ i = (grad /(gfc+i), grad f(x k+1 )) Xk+1 ^ 
+ (grad f{x k ), grad f(x k )) Xk 

i7 fc+ i = - grad /(x fc+ i) + Pk+iT™ (Vk), (3.27) 



'<x h ri k viKJi 

11: end for 
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4 Convergence analysis of the new algorithm 



In [7] , the convergence property of Algorithm 13.11 with the strong Wolfe conditions for < 
c\ < C2 < 1/2 is shown under the assumption that \\T ak r] k (ilk)\\x k+1 < ||%IU fc f° r an k S N. 
We wish to relax this assumption by using a scaled vector transport if necessary, keeping the 
convergence property holding good. We will show in Section [5] that there exist numerical 
examples in which the above inequalities do not hold for all k 6 N but our Algorithm 13.21 
indeed has an advantage in generating convergent sequences. In what follows, we analyse the 
convergence property of Algorithm 13.21 



4.1 Zoutendijk's theorem 

Zoutendijk's theorem about a series associated with search directions on M. n is not only valid 
for the conjugate gradient method but also valid for general descent algorithms [6]. We 
generalize the theorem so as to be applicable to a general descent algorithm (Algorithm 12. ip 
on a Riemannian manifold M. In the same manner as in R n , we define on a Riemannian 
manifold M the angle k between the steepest descent direction — gr&d f(x k ) and the search 
direction rj k through 

coz6k= (g^df{x k ),r] k ) Xk 

||grad/(a; fe )|UJ|?7fc|U ' 

Theorem 4.1. Consider Alaorithm \2.1\ on a Riemannian manifold M. Let rj k be a descent 
direction and a k satisfy the strong Wolfe conditions (|3.10p and (|3.1ip . Assume that the objec- 
tive function f is bounded below and of C l -class, and further that there exists a Lipschitzian 
constant L > such that 

\B(foR x )(t V )[ V }-B(foR x )(0)[ri}\<Lt, V G T X M with \\rj\\ x = 1, x E M, t > 0. (4.2) 

Then, the following series converges; 



oo 

^2cos 2 6 k \\gr&df(x k )\\ 2 Xk < oo. (4.3) 

k=0 



Remark 4.1. Before proving this theorem, we remark that the inequality (j4.2|) is of practical 
use in spite of its appearance. We will show in Appendix \M that Eq. (|4.2|) holds for objective 
functions in practical Riemannian optimization problems. 



Proof of Thm. \4-l\ From the strong Wolfe condition ()3.1ip and the Lipschitzian condition 
(14.211 together with r\ k being a descent direction, we have 

(c 2 - l)(grad/(x fe ),?7 fc ) :rfe 
<{gr&df(R Xk (a k n k )),T)R Xk (a k n k )[n k )) Xk+1 - (grad f(x k ), n k ) Xk 
=D(/ o R Xk ){a k n k )[n k ] - D(/ o R Xk )(0)[ m ] < a k L\\ Vk f Xk , (4.4) 

where the dilatation n k \- > r/ k /\\r] k \\ Xk has been taken into account in applying the inequality 
T2|) . From this, we obtain 



ak > (c 2 - l)(grad/(x fc ),rfc) Xfc 
L\\Vk\\l k 
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Since / is bounded below, there exists a constant C such that f(x) > C for any x £ M. It 
then follows from (|XTU]) and (@3J) that 

C < f(xk+i) < f(xk) + ciafc(grad/(x/ c ),r/ fc ) Xfc 
</(»fc) - ccos 2 fc ||grad/(x fe )||^ 
fc 

<f(x ) - c^cos 2 %||grad/(^)||2. (4.6) 

for any k, where c := c\{\ — o^j/L > 0, and where the second row inequality has been used 
successively. Eq. (|4.6|) results in 

fw^grad/fo)^ < < ^ (4.7) 

fc=0 

This completes the proof. □ 
4.2 Global convergence 

We first extend a lemma in [2] so as to be applicable to Algorithm 13.21 as follows: 

Lemma 4.1. XTie search direction % determined in Alaorithm \3.2\ with < ci < C2 < 1/2 is 

a descent direction satisfying 

1 ^ (gradf(x k ),r) k ) Xk ^ 2c 2 - 1 



l-c 2 " ||grad/(x fc )||2 fe : I-C2 ' 
Proof. The proof runs by induction. For k = 0, (|4.8|) clearly holds on account of 

-1. (4.9) 



(grad f(x ),r]o) xo (grad /(x ), - grad /(zo))x 



;rad/(x )|| 2 ||grad/(rr )|| 2 



■I'D 



Suppose that % is a descent direction satisfying (|4.8|) for some k. Note that by the definition 
of in Algorithm 13.21 we have 

jik) s \ = I T^O/fc), lf \\Tj% Vk {r)k)\\x k+1 < \\Vk\\x k , / 41Q s 
« fc % - \ 7^^(17*), otherwise, 



so that T R and are related by ||72 fc "4(%)IU fc+ i - 117^^(^)11^+1 in each case. Indeed, 

-(k) 



Ak) 

if \\Ta krih (,Vk)\\x k+1 > \\m\\x k , then \\T^l k {f] k )\\ Xk+1 is evaluated as 

\\T$l(m)\\* h+ i = H7^ fc (%)lk +1 = ll%lk < lir Q ^(%)IU fc+1 . (4.11) 

Since TSf^^fc) and 7^?„ (%) are in the same direction with the inequality 117^4 (^IUm-i — 
l|7^ % (>?fc)lk +1 in norm, we have 

Kgrad/Cx^iJ.^^WI < \(g^df(x k+1 ),T^ Vk ( m )) Xk+1 \. (4.12) 
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We also note that the vector transport T R is defined to be T^(^ x ) = ^Rx(Vx)[£x) in the 
algorithm. It then follows from (|3.1ip and (|4.12|) that 

c 2 {gradf(x k ),n k ) Xk < (grad f(x k +i), T^ k (Vk))x k+1 < -c 2 (grad f(x k ), rj k ) Xk , (4.13) 

where it is to be noted that r\ k is in a descent direction. The middle term in fj4.8[) with fc + 1 
for k is computed as 

(gradf(x k+1 ),r] k+1 } Xk+1 _ (grad f(x k+1 ), - grad f(x k+1 ) + (3 k+ iTq k l k (Vk))x k+1 
|| grad / (xjfe+i) 1 1| k+1 1 1 gr ad / (x fc+ i ) 1 1 2 fe+i 

(gTadf{x k+ i),T£l k {Vk))x k+1 . . 

||grad/(x fc )||» fc ' 1 " ) 

where the definition (|3.26p of f3 k has been used. Therefore, we obtain from (|4.13p and (|4.14p 

_ {gradf(x k ),rj k ) Xk (grad/(x fc+ i) (gradf(x k ),r] k } Xk 
° 2 l|grad/(x fc )||2 fc - ||grad/(x fc+ i)||2 fe+i - ° 2 ||grad/(x fe )lll fc 

The inequality (j4.8[) for k + 1 immediately follows from the induction hypothesis. □ 

We proceed to the global convergence property of Algorithm 13.21 The convergence of the 
conjugate gradient method has already been proved on W 1 by Al-Baali [2]. Exploiting the 
idea of the proof used in [2] , we show that Algorithm 13.21 generates converging sequences on 
a Riemannian manifold. 

Theorem 4.2. Consider Algorithm \3. £1 with the strong Wolfe conditions (|3.1U|) and (|3.1ip 
for < ci < c 2 < 1/2. If ([43]) and hence P~H]) /toW, i/iera 

liminf||grad/(x fc )|U fc = 0. (4.16) 

k— >oo 

Proof. If grad/(xfe) = for some fc, let &o be the smallest integer among such k. Then, 
we have f3 ko = and 7] ko = from (|3.26p and (|3.27p with k$ = k + 1, so that x ko +i = 
R x ko {a ko ri ko ) = R Xko (0) = x ko . It then follows that gr&df(x k ) = for all k > k . Eq. (|4.16p 
clearly holds in such a case. 

We shall consider the case in which gr&d f(x k ) ^ for all k and prove (|4.16j) by contra- 
diction. Assume that (|4.16p does not hold, that is, there exist a natural number K S N and 
a constant 7' > such that ||grad f(x k )\\ Xk > 7' for all k > K. Since grad/(xfc) / for all k, 
setting 7 := min^', ||grad/(xfc)|| 2 ,. fe , k = 0, . . . , K} > 0, we have 

\\gmdf(x k )\\ Xk > 7 > 0, VA; > 0. (4.17) 

Now from (|4.ip and (|4.8p . we obtain 

1 — 2c 2 llgrad f(x k )\\ x , 
cos8 k > „ J \ k . (4.18) 

i-c 2 wkWxk 

On account of Thm. 14.1] Eqs. (|4.3p and (|4.18p are put together to provide 

f;^^*<oo. (4.19) 

fc=0 Wlk\\x h 
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On the other hand, Eqs. (|4. 12|) . (|4.8|) . and the strong Wolfe condition ([3. lip are put together 
to give 

\(&adf(x k ),T$z$l k _ 1 (m-i))x k \ ^(g^f^klT^^im-i))^ 

< - c 2 (grad /(scfc-i), Vk-ljx^ 



< :r ^||grad/(s fc _i)||^_ 1 . (4.20) 



Using this inequality and the definition of j3 k , we obtain the recurrence inequality for ||%||^ fe 
as follows: 



\r, k \\ 2 Xk =||-grad/(x fc ) + PkT^l^Vk-i)]^ 



< ||grad f(x k )f Xk + 2/3, | (grad f(x k ), T^'lL, (Vk-i))x k I + ^MtlLi i^-l] 
<||grad/(x fc )||^ + -^-/3 fc ||grad/(xfc_i)||^ fc _ 1 + filWvk-iWl^ 



\x k 



C2 

=c||grad/(x fc )i fc +/32|| % _ 1 ||2 fc _ i) (4 . 21) 

where we have used the fact that WTak-iVk-iiVk-ijWxi,. — ||%-ilU fc _ 1 and put 

c := (1 + C2)/(l — C2) > 1. The successive use of this inequality together with the definition 

of f5 k results in 

\\r, k \\l k <c (||grad/(x fc )|| 2 fe + ^grad/O^-i)^ + . . . + fgfg_ x . . . /3 2 ||grad 

+ /3l/3i-i---/3i 2 hoi 

=c||grad/(x fc )||^ fe (j|grad.f (x fe )||~ 2 + ||grad/(x fe _i)||~ ( 2 _ 1 H h ||grad /(xi)||~ 2 

+ || grad /(x fc ) ||£ fc ||grad /(x ) ||" 2 

^llgrad/^^II^^Hgrad/lx,-)!!- 2 < ^||grad/(x fc )||^(fc + 1), (4.22) 
3=0 7 

where the use has been made of (|4.17p in the last inequality. The inequality (|4.22p gives rise 
to 

This contradicts (|4.19p and the proof is completed. □ 



5 Numerical experiments 

In this section, we compare Algorithm 13.21 with Algorithm 13.11 bv numerical experiments. As 
is shown in [7] , if the vector transport T referred to in Algorithm 13.11 is the differentiated 
retraction (|3.12p and satisfies 

\\T Vx (^)\\r x ( Vx) < Ux\\x, r} x ,£ x e T x M,x e M, (5.1) 

the convergence property of this algorithm with the strong Wolfe step size for < c\ < 
C2 < 1/2 is proved. However, if (|5.ip does not hold, it is not always ensured that sequences 
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generated by Algorithm 13.11 converge. In contrast with this, Algorithm 13.21 with < c\ < 
C2 < 1/2 indeed works well even if (|5.1|) fails to hold, as is stated in Thm. 14.21 In the 
following numerical experiments, we give an example which shows that Algorithm 13.11 fails 
but Algorithm 13. 21 works well. 

Consider the following Rayleigh quotient minimization problem on the sphere S 4 := 
{x G R 5 | x T x = l}: 

Problem 5.1. 

minimize f{x) = x T Ax, (5-2) 
subject to x G S 4 , (5.3) 

where A := diag(l, 2, 3, 4, 5). A Riemannian metric (•, •) on S 4 is defined by 

(£x,Vx)x ■= ^G x rj x , £ x ,r] x G T X S 4 , (5.4) 

where G x := diag(1000(x^^) 2 + 1,1,1,1,1), and where x^ denotes the first component of 
the column vector x. It is to be noted that this metric is not the standard one on S . 

The optimal solutions of this problem are x* = ±(1, 0, 0, 0, 0) T . With respect to the metric 
(|5.4p , the gradient of / is described as 

grad /(*) = 2(l- G- x Ax. (5.5) 

Indeed, the right-hand side of flJTSI belongs to T X S 4 = {£ G M 5 | x T i = 0} and it holds that 

[ 2 i 1 ~ G * lAx ' e ) = 2xTm = Df{xm (5 - 6) 

for any ^ E T X S ' . Let R be the retraction on S 4 defined by 

RM) = - X + ^ =, e G T X S 4 , x G S 4 , (5.7) 

which is the special case of the QR retraction ()A.3j) on the Stiefel manifold defined in Appendix 
IA"1 For this iZ, the differentiated retraction (|3.12p is chosen as the vector transport T. 

We note that though the metric endowed with is not the standard one, the assumption 
(|4.2p holds, as is mentioned in Rem. IA.2I in Appendix [AJ Hence from Thm. 14.21 Algorithm 
13.21 works well in theory. 

Figs. l5.11 and l5.2l show numerical results from applying Algorithm l3.1l with the strong Wolfe 
step size to Problem 15.11 with the initial point x$ = (1, 1, 1, 1, 1) T ' /y/E G S 4 , where < c\ < 
C2 < 1/2. The vertical axes of Fig. 15. II and [5.21 carry values of the first components of Xk 
and values of the ratios \\T akm {nk)\\x k+1 /\\Vk\\x k , respectively. If \\T ahrih {r}k)\\x k+ x/\\f}k\\x k < 1 
for all k G N, the sequence {x&} would converge. However, as is observed in Fig. 15.21 the ratio 
ll'^afc%( r ?fc)IUfc+i/ll r ?fclUfe intermittently exceeds the value 1. This fact prevents the sequence 
from converging. To see this, we put these two figures together into Fig. 15.31 Fig. 15.31 shows 
that peaks of two graphs synchronize and then the sequence fails to approach the optimal solu- 
tion (1, 0, 0, 0, 0) T because of the violation of the inequality (|5.ip . This phenomenon happens, 
since the first diagonal element of G x becomes large in the neighborhood of (1, 0, 0, 0, 0) T . 
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Iteration 



Figure 5.1: The sequence of the first components from the sequence {xk} generated by 
Algorithm 13, II with the vector transport (|3. 12j) . 



Iteration 



Figure 5.2: Ratios ||7^ i , % (%)IU fe+1 /||^fc|U fc evaluated from the sequences {x^ and {rjk} gen- 
erated by Algorithm 13.11 with the vector transport (|3.12|) . 
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Figure 5.3: x k and ||7^ fc% (%)IU fc+ i/||%IU fc by Algorithm l3.1l with the vector transport (|3.12p . 




Iteration 



Figure 5.4: The sequence of the first components from the sequence {x/c} generated by 
Algorithm 13.21 
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In contrast with this, in Algorithm 13.21 the vector transport T R is scaled if necessary, and 
thereby generated sequences converge to solve Problem 15. 11 In comparison with Fig. 15.11 the 
sequence given in Fig. 15.41 shows that the present algorithm generates a converging sequence, 
resolving the difficulty of being repelled from the optimal solution. We here note that (|5.ip 
is never violated in this algorithm. 

Incidentally, we may use Algorithm 13.11 with the scaled vector transport (|2.10p . Since the 
condition (|5.ip holds for the scaled vector transport, the global convergence property of the 
algorithm can be proved, if the condition (|3.2ip is satisfied. The numerical result is shown in 
Fig. 15.51 The comparison of Fig. 15.51 and Fig. 15.41 shows that this algorithm is not so efficient 
as Algorithm 13.21 even though a subsequence {xj^} seems to converge to 1 and hence {x^} 
converges to the optimal solution. 




Figure 5.5: The first components x k with the sequence {x^} generated by Algorithm 13.11 
with the scaled vector transport (|2.10p associated with (|3.12p . 



6 Concluding Remarks 

We have dealt with the global convergence of the conjugate gradient method with the Fletcher- 
Reeves p. Though the conjugate gradient method generates globally converging sequences 
in the Euclidean space, the conjugate gradient method on a Riemannian manifold M has 
not been shown to have a convergence property in general, but under the assumption that 
the vector transport does not increase the norm of the tangent vector, the convergence is 
proved in [7J. If the parallel translation is adopted as T, the conjugate gradient method is 
shown to generate converging sequences as in [9]. However, the parallel translation is not 
convenient for computational purpose. For computational efficiency, we have introduced a 
vector transport, in place of the parallel translation, with a modification that the vector 
transport T is replaced by the scaled vector transport T° only when T increases the norm of 
the search direction vector. Then we have achieved a balance between computational efficiency 
and the global convergence by proposing Algorithm 13.21 We have shown the convergence of 
the present algorithm both in the theoretical and the numerical viewpoints. In particular, 
we have performed numerical experiments to show that the present algorithm can solve the 
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problem for which the existing algorithm cannot work well because of the violation of the 
assumption about the vector transport. 



A Examples in which the condition (14.21 ) holds 

In Thm. 14. 1\ we assume that the condition (|4.2p holds. This assumption is far from im- 
practical. For example, the problem of minimizing the Brockett cost function on the Stiefel 
manifold St(p,n) with the natural induced metric [T] has this property, as is shown below. 

Let n,p be positive integers with n > p. The Stiefel manifold St(p,n) is defined to be 
St{p,n) := {X € R nxp | X T X = I p ). We consider St(p, n) as a Riemannian submanifold of 
R nxp endowed with the natural induced metric 

{$,ri)x := tr(£ T ??), £,77 G T x St(p,n). (A.l) 

Let A be an n x n symmetric matrix and N := diag(^i, [12, ... , /i p ) with < ix\ < ■ ■ ■ < [i v . 
The Brockett cost function / is defined on St(p, n) to be 

f{X) = tr (X T AXN) . (A.2) 

Further, the QR decomposition-based retraction (which we call the QR retraction) R is defined 
to be 

R x (Z):=qf(X + 0, £eT x St(p,n), XeStfan), (A.3) 

where qf(-B) denotes the Q-factor of the QR decomposition of a full rank matrix B £ M. nxp . 
That is, if B is decomposed into B = QR, where Q € St(p, n) and R is an upper triangular 
p x p matrix with positive diagonal elements, then qf (B) = Q. 

Proposition A.l. The inequality (|4.2p holds for the Brockett cost function ()A.2p on M = 
St(p, n), where St(p, n) is endowed with the natural induced metric (jA.lj) . and where the QR 
retraction ()A.3|) is adopted. 

Proof. Since the function (|A.2j) is smooth, we have only to show that 
d 2 



2 (foR x )(trj) 



< L, rj G T x St(p,n) with =l,Ie St(p,n), i > 0. (A.4) 



In fact, Eq. (j4.2|) is a straightforward consequence of this inequality. Let Q(t) be a curve 
defined by -Rx (tr]) = qi(X + trj), and x^, rjk, qk(t) denote the k-th. column vectors of X, 77, Q(t), 
respectively. Then, through the Gram-Schmidt orthonormalization process, we obtain 



'\x k +tr] k - Ei=ife(*))^fc +tr] k )qi'' 



where (a, 6) := a r 6 and ||a|| := (a, a) for n-dimensional vectors a, b. By induction on A;, we 
can take vector- valued polynomials g k (t) in t satisfying 

Q k (t) = JZ&L, t > 0. (A.6) 
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Indeed, for k = 1, (|A.6P holds with <?i(i) = x\ +trj\. Suppose that ()A.6P holds for 1, . . . , k — 1. 
Then we can write out Q'fc(t) as 



\UjZi \\9j(t) || 2 (s/fe +%) - Eti n j Vills , j(*)ll 2 (s'i(i),a;fe + 



Then we have 



Denoting by gk(t) the numerator of the right-hand side of (|A.7p . which is a polynomial in t, 
we obtain (|A.6p . 
Let 

h(X,r,,t) = -^(foR x )(tr ) ). (A.8) 
p d 2 

?7,t) /xfc^ M*) T 4fo(*)) ■ (A.9) 

fc=i 

Since qk(t) T Aq^{t) = gk{t) T Agk(t)/\\gk(t)\\ 2 , and since the degree of the numerator polynomial 
in t is not more than that of the denominator polynomial, the degree of the numerator 
polynomial from the right-hand side of (|A.9j) is less than that of the denominator polynomial, 
so that one has, as t — > oo, 

lim h(X,n,t) = 0. (A.10) 

t— >oo 

This implies that h(X,rj,t) is bounded with respect to t > 0. Moreover, the h(X,r],t) is 
continuous with respect to X and r] on the compact set {(X, rf) G TSt(p, n) | ||r/||x = 1}. It 
then turns out that h(X, 77, t) is bounded on the whole domain, which implies that there exists 
L > such that (|A.4p holds. This completes the proof. □ 

Remark A.l. Reviewing the proof, we observe that since the QR retraction is irrespective of 
the metric with which the St (p, n) is endowed, and since the set{{X,rf) £ TSt(p,n)\ \\r]\\x = 1} 
is compact with respect to any metric on St(p,n), the inequality ()4.2p with R being the QR 
retraction (|A.3p holds for the Brockett cost function (]A.2p independently of the choice of a 
metric. 

Remark A. 2. We also note that Prop. \A.1\ and Rem. \A.1\ cover both the Rayleigh quotient 
on the sphere 5 n_1 as p = 1 and the Brockett cost function on the orthogonal group as p = n. 
In particular, the inequality (|4.2p holds for the function (|5.2p , though the sphere S 4 is endowed 
with the non-standard metric ([57 



Another example for (|4.2p comes from the problem of minimizing the function 

F(U, V) = ti{U T AVN) (A.ll) 

on St(p, to) x St(p, n), where A is an to x n matrix and N = diag(/ii, . . . , fi p ) with fj,\ > ■ ■ ■ > 
fip > 0. An optimal solution to this problem gives the singular value decomposition of A 
[8\. Let m,n,p be positive integers with m > n > p. We consider St(p, m) x St(p, n) as a 
Riemannian submanifold of M. mxp x M. nxp endowed with the natural induced metric; 

{{turn), (&,m))(u,v) ■= trteifc) + tr(vlm), 

tei,m),fa,m) e%y)(St(p,TO) xSt(p,n)). (A.12) 
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As in the previous example on St(p, n), the QR retraction on St(p, m) x St(p, n) is defined by 
R(U,V)(C,V) ■= (qf(U + 0,q£(V + V )), (H, V ) G T {uy) (St(p,m) x St(p,n)) (A.13) 
for (U,V) G St (p,m) x St (p,n). 

Proposition A. 2. The inequality (j4.2|) /2,0/ds /or i/je objective function (jA.lip on M = 

St(p,m) x St(p, n), where M is endowed with the natural induced metric (|A.12j) and u>ii/i 
i/ie retraction (|A.13p . 



Proof. We shall show that 



< L 



(A.14) 



for(f,T;)er (tw (St(p,m)xSt(p,n)) with ||(£,77)|| (I/)V) = 1, (17, V) G St(p, m) x St(p, n), i> 
0. Put Q(t) = qf(U + t£), S(t) = qf(V + irj). Let g fe (i) and s fc (rj) denote the fc-th column 
vectors of Q(t) and S'(i), respctively. From Prop. lA.ll and its course of the proof, there exist 
vector- valued polynomials gk{t) and hk(t) such that 



u\ 9k(t) , . 
= Ti 7TTT7) s k{ t ) 



\9k 



h k (t) 
\\hk(t)\\ 



Let 



Then we have 



H(U,V,Z, V ,t) 



dt 2 



{FoR (uy) )(t(C,v)) 



H(U,V,£,ri,t) =X>fc^2 {qk{t) T As k (t)) 



fe=i 



(A.15) 

(A.16) 
(A.17) 



Since gfc(t) Asfc(t) = <?fe(i) A/ifc(t)/(||5fc(t)|| ||/ifc(i)||), by the same reasoning as that for h(X,£,t) 
in Prop. IA.ll we have 



hm H(U,V,d,ri,t)=0, 



(A.18) 



so that H(U, V, £, 77, i) is bounded with respect to t > 0. Further, i7(J7, V, £, 77, i) is continuous 
with respect to (U, V, £, 77) on the compact set 

{(f7,V,£,7?) G T(St(p,m) x St(p,ra)) | IKC^H^y) = l}- Hence H(U, V, £, rj, t) is bounded on 
the whole domain. This completes the proof. □ 

A remark similar to Rem. [ATT1 can be made on the metric to be endowed with on St(p, m) x 
St(p, n). The validity of (|4.2p is independent of the choice of a metric. 
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